gradient distribution
- North America > United States > Minnesota (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- (4 more...)
Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantee, many learning systems now incorporate differential privacy by training their models with (differentially) private SGD. A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its l2 norm exceeds a certain threshold. We first demonstrate how gradient clipping can prevent SGD from converging to a stationary point. We then provide a theoretical analysis on private SGD with gradient clipping. Our analysis fully characterizes the clipping bias on the gradient norm, which can be upper bounded by the Wasserstein distance between the gradient distribution and a geometrically symmetric distribution. Our empirical evaluation further suggests that the gradient distributions along the trajectory of private SGD indeed exhibit such symmetric structure. Together, our results provide an explanation why private SGD with gradient clipping remains effective in practice despite its potential clipping bias. Finally, we develop a new perturbation-based technique that can provably correct the clipping bias even for instances with highly asymmetric gradient distributions.
Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantee, many learning systems now incorporate differential privacy by training their models with (differentially) private SGD .
- North America > United States > Minnesota (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- (4 more...)
9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf
We thank the reviewers for their positive and constructive feedback. We address several points in the review below. The bias reduction technique in Section 5 is designed for DP-SGD with clipping. When it is applied to DP-SGD, the update rule is shown below. Typos: Thank you for pointing them out, we will correct the typos.
On Design Principles for Private Adaptive Optimizers
Ganesh, Arun, McMahan, Brendan, Thakurta, Abhradeep
The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
Multi-Modal Learning with Bayesian-Oriented Gradient Calibration
Guo, Peizheng, Wang, Jingyao, Guo, Huijie, Li, Jiangmeng, Sun, Chuxiong, Zheng, Changwen, Qiang, Wenwen
Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality's gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.
- Asia > China > Beijing > Beijing (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Information Technology (0.67)
- Health & Medicine > Health Care Technology (0.46)
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Ma, Chao, Gong, Wenbo, Scetbon, Meyer, Meeds, Edward
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.
- Asia > Middle East > Jordan (0.05)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > British Indian Ocean Territory > Diego Garcia (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)
Data-Driven Gradient Optimization for Field Emission Management in a Superconducting Radio-Frequency Linac
Goldenberg, Steven, Ahammed, Kawser, Carpenter, Adam, Li, Jiang, Suleiman, Riad, Tennant, Chris
However, since the energy upgrade, CEBAF has suffered from significant FE induced radiation. With RF on, dose Jefferson Lab's Continuous Electron Beam Accelerator rates observed at 30 cm from the beamline are as high Facility (CEBAF) [1] relies on two superconducting as 10 rem/h and 100 rem/h for neutron and gamma radiation, radio-frequency linear accelerators (SRF linacs) to deliver respectively. This level of radiation causes significant high-energy electron beams to nuclear physics experiments damage to beamline components, including vacuum in the four experimental halls [2]. An integral valves, magnets, and cables of beam position monitors part of these linacs are cryomodules which contain and ion pumps. Replacing these components can use multiple SRF cavities. These SRF cavities provide the significant resources. Worse, portions of both linacs are main accelerating gradients to the electron beam, and considered "Radiation Areas" for days or even weeks into currently produce the 12 GeV beam necessary for scientific scheduled downtime, limiting maintenance activities to discovery.
- Europe > Switzerland > Geneva > Geneva (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia > Richmond (0.04)
- (7 more...)
- Media > Radio (0.60)
- Leisure & Entertainment (0.60)
Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantee, many learning systems now incorporate differential privacy by training their models with (differentially) private SGD. A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its l2 norm exceeds a certain threshold. We first demonstrate how gradient clipping can prevent SGD from converging to a stationary point. We then provide a theoretical analysis on private SGD with gradient clipping.